Author

Emorie D Beck

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.2     ✔ purrr   1.0.2
✔ tibble  3.2.1     ✔ dplyr   1.1.3
✔ tidyr   1.2.1     ✔ stringr 1.5.0
✔ readr   2.1.2     ✔ forcats 0.5.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ ggplot2::%+%()     masks psych::%+%()
✖ ggplot2::alpha()   masks psych::alpha()
✖ dplyr::arrange()   masks plyr::arrange()
✖ purrr::compact()   masks plyr::compact()
✖ dplyr::count()     masks plyr::count()
✖ dplyr::desc()      masks plyr::desc()
✖ dplyr::failwith()  masks plyr::failwith()
✖ dplyr::filter()    masks stats::filter()
✖ dplyr::id()        masks plyr::id()
✖ dplyr::lag()       masks stats::lag()
✖ dplyr::mutate()    masks plyr::mutate()
✖ dplyr::rename()    masks plyr::rename()
✖ dplyr::summarise() masks plyr::summarise()
✖ dplyr::summarize() masks plyr::summarize()

Week 4 - Codebooks and Data Documentation

Outline

  1. Documenting your design
  2. Building a codebook
  3. Cleaning your data using codebooks
  4. Problem set and Question time
  • Note: the schedule says we were going to talk about data import today. I want to focus on codebooks, so please read R for Data Science Chapters 8 and 21 to make sure you understand those pieces

Documenting your design

  • Documentation is a critical part of open science, but not one we’re really taught
  • Documentation is going to look different for different types of research, but it’s not a hopeless cause to think about common features of documentation
  • Common Documentation:
    • Preregistration
    • Experiment Script (for standardizing across experimenters)
    • Survey / experimental files / stimuli / questions
    • Codebooks of all variables collected
    • Codebooks of variables used in a given study
  • In this workbook, I want to touch on three things:
    • Preregistration (brief, mostly focusing on pointing you to resources)
    • Protocol and design flow
    • Codebooks of variables used in a given study (and how to use it in R)

Preregistration

  • Preregistration:
    • Specifying your study design, research questions, hypotheses, data cleaning, analytic plan, inference criteria, and robustness checks in advance
  • Why should you preregister?
    • Badges are fun
    • Preregistrations are not rigid but a chance to think through the questions you want to ask and answer and the challenges that might arise in doing so
    • Builds trust in the scientific process

  • Preregistration is hard
    • Specifying your plan in advance takes considerable effort and time, which can feel like very slow science
  • Preregistration is worthwhile
    • But preregistering plans, code, etc. can speed up the analytic portion of your research workflow, which builds great momentum for writing and submitting projects

What should I preregister?

  • Depends on the project, some examples include study design, individual research projects, etc.
    • Study design: A large survey is collected or a multi-part experiment is conducted. Measures, design, some research questions and hypotheses are specified a priori
    • Individual paper / project: A single-part survey or experiment is conducted or a specific piece of a multi-part study is investigated. If part of a multi-part study/experiment, should be linked to the parent preregistration

Learning More:

Protocol and Design Flow

  • Procedure sections in scientific papers are meant to map out, as concisely and simply as possible, how data were obtained (adhering to human subjects ethical codes, etc.)

  • But such sections are not sufficient to replicate or reproduce research because study designs are much more intricate and include many more details than what fits in a method section

    • e.g. measures not used because they weren’t focal, the code tha tunderlies how data are collected, preprocessing, etc.
  • As researchers, it’s our job to make sure that the work we do is documented so well that someone could replicate our studies.

  • Think of it sort of like doing your taxes. You want to keep enough information that if you were audited, you would be able to quickly and easily provide all the relevant information.

  • What you need to document will depend on the kind of work you do.

  • As an example, in my ecological momentary assessment work, I do the following:

    • Preregister the design
    • Write a methods section that includes text for every measure included in any part of the study as well as an extended and detailed procedure description. This also includes information on how data will be cleaned and composited
    • Detailed codebook including all measures that were collected, regardless of whether I have research questions or hypotheses for them. This is shareable for anyone who wants to use the data
    • Make technical workflow. This documents how all documents, scripts, etc. work together to produce the final result, including what is automated, what requires researcher action, etc.
    • Comment all code and documents extensively
    • Deviations document, where I document every deviation from my initial plans after the design is complete and data begin to be collected (or analyses start)
  • Extensive documentation is also an investment in future you! My measures and procedures section basically write themselves, and my analytic plan is written in the preregistration

  • This both means that I’m faster and more efficient at writing these and that I feel more confident about the design choices I made, which is a win-win

Codebooks

  • For me, codebooks are the most essential and important part of any research project
  • Codebooks allow me to:
    • parse through documentation and find all the variables I want
    • document detailed information about each of those variables
    • make cleaning and compositing choices for each (e.g., renaming, recoding, removing missings, etc.)
    • differentiate among the kind of variables I have (e.g., predictors, outcomes, covariates, manipulations, and other categories)
    • Pass all this information into R to aid in data cleaning

Example Codebook

In this case, we are going to using some data from the German Socioeconomic Panel Study (GSOEP), which is an ongoing Panel Study in Germany. Note that these data are for teaching purposes only, shared under the license for the Comprehensive SOEP teaching dataset, which I, as a contracted SOEP user, can use for teaching purposes. These data represent select cases from the full data set and should not be used for the purpose of publication. The full data are available for free at https://www.diw.de/en/diw_02.c.222829.en/access_and_ordering.html.

For this tutorial, I created the codebook for you: Download, and included what I believe are the core columns you may need. Some of these columns may not be particularly helpful for every dataset.

My Core Codebook Columns

Here are my core columns that are based on the original data:

  • dataset name (dataset)
  • how I categorize the variables (category)
  • how I rename each item (item)
  • how I composite the variables (name)
  • original variable name (old_name)
  • original item text (item_text)
  • original item values (scale)
  • how I will recode each item (in text; recode_desc)
  • how I will recode each item (in R; recode)
  • whether item is reverse coded (reverse)
  • scale minimum (mini)
  • scale maximum (maxi)
  • timeline of variable collection (year or wave)
  • meta name / never changing name (meta)
  1. dataset: this column indexes the name of the dataset that you will be pulling the data from. This is important because we will use this info later on (see purrr tutorial) to load and clean specific data files. Even if you don’t have multiple data sets, I believe consistency is more important and suggest using this.

  2. category: broad categories that different variables can be put into. I’m a fan of naming them things like “outcome”, “predictor”, “moderator”, “demographic”, “procedural”, etc. but sometimes use more descriptive labels like “Big 5” to indicate the model from which the measures are derived.

  3. name: label is basically one level lower than category. So if the category is Big 5, the label would be, or example, “A” for Agreeableness, “SWB” for subjective well-being, etc. This column is most important and useful when you have multiple items in a scales, so I’ll typically leave this blank when something is a standalone variable (e.g. sex, single-item scales, etc.).

  4. item_name: This is the lowest level and most descriptive variable. It indicates which item in scale something is. So it may be “kind” for Agreebleness or “sex” for the demographic biological sex variable.

  5. old_name: this column is the name of the variable in the data you are pulling it from. This should be exact. The goal of this column is that it will allow us to select() variables from the original data file and rename them something that is more useful to us.

  6. item_text: this column is the original text that participants saw or a description of the item.

  7. scale: this column tells you what the scale of the variable is. Is it a numeric variable, a text variable, etc. This is helpful for knowing the plausible range.

  8. recode_text: sometimes, we want to recode variables for analyses (e.g. for categorical variables with many levels where sample sizes for some levels are too small to actually do anything with it). I use this column to note the kind of recoding I’ll do to a variable for transparency.

  9. recode: I write the R code I’ll parse by reading my codebook into R into this column.

Here are additional columns that will make our lives easier or are applicable to some but not all data sets:

  1. reverse: this column tells you whether items in a scale need to be reverse coded. I recommend coding this as 1 (leave alone) and -1 (reverse) for reasons that will become clear later.
  2. mini: this column represents the minimum value of scales that are numeric. Leave blank otherwise.
  3. maxi: this column represents the maximum value of scales that are numeric. Leave blank otherwise.
  4. year: for longitudinal data, we have several waves of data and the name of the same item across waves is often different, so it’s important to note to which wave an item belongs. You can do this by noting the wave (e.g. 1, 2, 3), but I prefer the actual year the data were collected (e.g. 2005, 2009, etc.)
  5. meta: Some datasets have a meta name, which essentially means a name that variable has across all waves to make it clear which variables are the same. They are not always useful as some data sets have meta names but no great way of extracting variables using them. But they’re still typically useful to include in your codebook regardless.

Download Example Codebook

Below, let’s download the codebook we will use for this study, which will include all of the above columns. We’ll load it in later. For now, let’s explore it.

Code
# set the path
wd <- "https://github.com/emoriebeck/psc290-data-FQ23/raw/main/04-workshops/04-week4-readr"

download.file(
  url      = sprintf("%s//codebook.xlsx", wd), 
  destfile = "codebook.xlsx"
  )

Codebook Tab

The resulting codebook looks something like this:

  • In addition, to the codebook, I also document other overarching info three other tabs

Overview Tab

  • Overview just lists what variables I’m considering as serving different functions (e.g., demographics, covariates, moderators, predictors, outcomes, indepenedent variables, dependent variables, etc.)

Key Tab

  • Key helps me create tables that I’ll be able to use in R to help me rename things. This is super helpful for making final tables and figures!!

Sample Tab

  • Sample helps other people understand how you’re using the columns. This is generally good practice and also helpful if you have research assistants or collaborators helping you out!

Example

Now, we’re going to walk through an extended example. But first, let’s start with a description of the data.

First, we need to load in the data. We’re going to use three waves of data from the German Socioeconomic Panel Study, which is a longitudinal study of German households that has been conducted since 1984. We’re going to use more recent data from three waves of personality data collected between 2005 and 2013.

Note: we will be using the teaching set of the GSOEP data set. I will not be pulling from the raw files as a result of this. I will also not be mirroring the format that you would usually load the GSOEP from because that is slightly more complicated and something we will return to in a later tutorial on purrr (link) after we have more skills. I’ve left that code in the .qmd for now, but it won’t make a lot of sense right now.

Workspace

  • Download the following .zip file with an R project
  • We’re going to walk through this script, and then you will spend the rest of class working on your problem set, which basically does the same with your own data.

Packages

Code
library(psych)
library(plyr)
library(tidyverse)

Codebook

Code
loadWorkbook_url <- function(url, sheet) {
    temp_file <- tempfile(fileext = ".xlsx")
    download.file(url = url, destfile = temp_file, mode = "wb", quiet = TRUE)
    readxl::read_excel(temp_file, sheet = sheet)
}

url <- "https://github.com/emoriebeck/psc290-data-FQ23/raw/main/04-workshops/04-week4-readr/codebook.xlsx"
codebook <- loadWorkbook_url(url, sheet = "codebook") %>%
  mutate(old_name = str_to_lower(old_name))

key <- loadWorkbook_url(url, sheet = "Key")
traits   <- key %>% filter(category == "Big 5")
outcomes <- key %>% filter(category == "out")
covars   <- key %>% filter(category == "dem")

Data

Code
vars <- codebook$old_name
soep <- read_csv(file = "https://github.com/emoriebeck/psc290-data-FQ23/raw/main/04-workshops/04-week4-readr/soep.csv") %>%
  select(one_of(vars)) # keep vars from codebook
soep